Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix Hyperdisk ControllerExpandVolume Edge Cases #1899

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

sunnylovestiramisu
Copy link
Contributor

@sunnylovestiramisu sunnylovestiramisu commented Jan 2, 2025

What type of PR is this?
/kind bug

What this PR does / why we need it:
A hyperdisk with size 4Gi needs exactly 2000 IOPS, and 5Gi needs exactly 2500 IOPS. Disks 6Gi and more need a minimum of 3000 IOPS. When resizing a hyperdisk, there is a need to update IOPS as well for some corner cases. Currently ControllerExpandVolume is failing for these edge cases.

The GCE disk update API does not support resizing async PD today. Even currently the PDCSI driver does not support resizing sync PDs, to make it a two way door, the proposal is:

  • Use GCE Resize API for PD disks ControllerVolumeExpand
  • Use GCE Update API for Hyperdisks ControllerVolumeExpand

Testing:

  1. Use kubetest clusters(machine type support for Hyperdisk) NUM_NODES=1 NODE_SIZE="n4-standard-4" MASTER_SIZE="n4-standard-4" NODE_DISK_TYPE="hyperdisk-balanced" kubetest --up
  2. Deploy PDCSI driver with this change
  3. Create PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-demo
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 4Gi
  storageClassName: hyperdisk-balanced-sc
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: hyperdisk-balanced-sc
provisioner: pd.csi.storage.gke.io
parameters:
  type: hyperdisk-balanced
  provisioned-iops-on-create: "2000"
  provisioned-throughput-on-create: "200Mi"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
kind: Pod
apiVersion: v1
metadata:
  name: pod-demo
spec:
  volumes:
    - name: pvc-demo-vol
      persistentVolumeClaim:
       claimName: pvc-demo
  containers:
    - name: pod-demo
      image: nginx
      resources:
        limits:
          cpu: 10m
          memory: 80Mi
        requests:
          cpu: 10m
          memory: 80Mi
      ports:
        - containerPort: 80
          name: "http-server"
      volumeMounts:
        - mountPath: "/usr/share/nginx/html"
          name: pvc-demo-vol

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fix Hyperdisk ControllerExpandVolume Edge Cases for 4,5 and 6 Gi Resize

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 2, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sunnylovestiramisu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 2, 2025
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 3, 2025
@sunnylovestiramisu
Copy link
Contributor Author

/test pull-gcp-compute-persistent-disk-csi-driver-unit

@mattcary
Copy link
Contributor

mattcary commented Jan 3, 2025

Can you expand on the release notes with the specific sizes (the 4, 5 and 6 G thresholds that you mention above)? It's handy to have those specifics in the release notes that go out with the new version.

@@ -134,6 +134,7 @@ type ParameterProcessor struct {
type ModifyVolumeParameters struct {
IOPS *int64
Throughput *int64
SizeGb *int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, doing a size change in modify volume will cause problems because there's no node rpc to update the filesystem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, you're just using this in controller expand volume.

I think we at least need a comment that this is only used in that RPC so there's no future confusion.

Copy link
Contributor Author

@sunnylovestiramisu sunnylovestiramisu Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also need to change the current implementation of the GCE Update Disk, there is a restriction right now that the request has to pass in one of iops or throughput, while it can take sizeGb without any iops/throughput. And we will move the iops/throughput check in the actual ControllerModifyVolume func.

Copy link
Contributor Author

@sunnylovestiramisu sunnylovestiramisu Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it turns out to be more complicated, because currently we have resizeZonalDisk + resizeRegionDisk, but for update we only have updateZonalDisk.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 3, 2025
@sunnylovestiramisu sunnylovestiramisu force-pushed the fixResize branch 2 times, most recently from 98f1959 to 7350edf Compare January 3, 2025 19:06
@sunnylovestiramisu
Copy link
Contributor Author

/test pull-gcp-compute-persistent-disk-csi-driver-e2e

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 3, 2025
@sunnylovestiramisu sunnylovestiramisu force-pushed the fixResize branch 5 times, most recently from 68aef8d to 526deb2 Compare January 7, 2025 18:58
@sunnylovestiramisu sunnylovestiramisu force-pushed the fixResize branch 2 times, most recently from 88259a3 to b42f936 Compare January 7, 2025 19:47
@sunnylovestiramisu
Copy link
Contributor Author

/test pull-gcp-compute-persistent-disk-csi-driver-e2e

@k8s-ci-robot
Copy link
Contributor

@sunnylovestiramisu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gcp-compute-persistent-disk-csi-driver-e2e f26d093 link true /test pull-gcp-compute-persistent-disk-csi-driver-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants